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Unsupervised clustering, also known as natural clustering, stands for the classification of data 
according to their similarities. Here we study this problem from the perspective of complex 
networks. Mapping the description of data similarities to graphs, we propose to extend two 
multiresolution modularity based algorithms to the finding of modules (clusters) in general data 
sets producing a multiscales' solution. We show the performance of these reported algorithms 
to the classification of a standard benchmark of data clustering and compare their performance. 
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1. Introduction 

$_i ' The problem of unsupervised data clustering consists in classifying elements so that two data points 
. 5^ . belonging to the same cluster are more similar between them than with elements in a different cluster. An 
element, or pattern, is a vector of features (usually understood as a point in a multidimensional space) that 
describes the item we wish to classify. The goal of the process of data clustering is to organize these patterns 
finding a partition of the sample according to natural classes that are present in it. Data clustering has been 
the subject of interest in many disciplines where the mining of raw information is crucial to understand some 
phenomenon or gain insight into a system. It has applications in seve ral fields such as pattern recognition, 



astronomic classification, biological taxonomy, marketing, and more [Gan et all l2007l | . 

The methodology used to obtain the clusters from the raw data is as follows: First of all, a representation 
of the patterns has to be chosen, and also a feature selection or extraction is performed. Feature selection 
means choosing, from all the available features, those that will make easier the process of clustering, 
leaving the redundant, correlated and less informative features out of the analysis. On the other hand, 
feature extraction consists in transforming the original data set to a new one containing only the most 
relevant information. This first step is very important, as the result of the clustering often depends directly 
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on its quality. Secondly, the similarity or dissimilarity between each pair of patterns has to be computed, 
which is often done by defining a measure of distance. The result of this step is the similarity matrix, which 
using the mapping to complex networks can be understood a, s a graph, where each node is a pattern and 
the links are the similarities between them Jain et al. , 19991 ]. Finally, the main step of the process, the 
grouping (or clustering) algorithm, which will decompose the similarity matrix and return the groups of 
data. 

The problem of clustering is inherently ill-posed, i.e. any data set can be clustered in drastically dif- 
ferent ways, with no clear criterion for preferring one clustering over another. In particular, in the case of 
unsupervised approaches, a satisfactory clustering of data depends on the desired resolution which deter- 
mines the number of clusters and their size. For example, fc-means clustering fixes a priori the number of 
groups ffc), which implies indeed a c ertain resolution of the clustering. Other algorithms such as hierarchical 
clustering [Kaufman fc Rousseeuw . group the patterns extending the measure of distance between 

them to distances between clusters of patterns. This process generates a complete dendrogram. Cutting the 
dendrogram at different heights we obtain different partitions of the data, all them hierarchically nested. 
In this situation the following question arises: To what resolution should one look at the data to find a 
scientific meaning in the classification? We claim that the answer to this question is totally dependent on 
the final purpose of the classification process, and that the concept of best solution should be reconsidered. 
Different partitions will be representative of properties of the data at different scales and then all of them 
are worth to be studied. 

In this work we perform a comparison between two different multiresolution algorithms, used in the 
field of complex networks to detect community structure, applied to the problem of data clustering. We also 
compare our results with a hierarchical clustering (HC) algorithm. In contrast with hierarchical clustering 
the multiresolution methods are not necessarily hierarchical. The first algorithm is the multiresolution 
static screening of the to pology of the netwo rk, based on the introduction of a control parameter in the 
resolution of modularity [Arenas et al\. 2008 ] (AFG method), proposed by the authors. The second one is 
a multiresolu tion dynamic screening of the n etwork structure using a method, inspired in the Potts model, 
proposed by [Reichardt fc Bornholdt , 2004i | (RB method). Both algorithms show to be competitive with 
classical clustering methods in the classification of the Iris data set. 



2. The complex networks approach 

Complex networks are graphs re presentat ive of the i ntricate connect i ons b etween elements in many natural 
and artificial systems Strogatd . 2001; Song et al. . 2005 : Barabasi . 2005], whose description in terms of 
statistical properties has been largely developed in the curse for a universal classification of them. However, 
when the networks are locally analyzed some characteristics that become partially hidden in the statistical 
description emerge. The most relevant perhaps is the discovery in many of them of community structure, 
meaning the ex istence of densely (or stron gly) connected groups of nodes, with sparse (or weak) connections 
between them Girvan Newmanl . 120021 ] . 

The study of the community structure helps to elucidate the organization of the network s and, even- 
tually, could be related to the functionality of groups of nodes Guimera fc: Amarall . l2005bl ]. The most 
successful solutions to the community detection problem, in terms of accuracy, are those based in the 
optim ization of a quality function called modularity proposed by Newman and Girvan Newman &: Girvanl . 
20041 ] that allows the comparison of different partitioning of the network. Given a network partitioned into 
communities, being Ci the community to which node i is assigned, the mathematical definition of modu- 
larity is expressed in terms of the weighted adjacency matrix Wij, that represents the value of the weight 
in the link between nodes i and j, this weight would be if no link existed, and the strengths Wi = ^ 
as (Newmanl . l2004a l 
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where the Kronecker delta function 6{Ci, Cj) takes the value 1 if node i and j are into the same community, 
otherwise, and the total strength is 2w = Wi. The modularity of a given partition is then the probability 
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of having edges falling within groups in the network minus the expected probability in an equivalent (null 
case) network with the same number of nodes, and edges placed at random preserving the nodes' strength. 
The larger the modularity the best the partitioning is, cause more deviates from the null case. Note that 
the optimization of the modularity cannot be perform ed by exhau stive search since the number of different 
partitions is equal to the Bell or exponential numbers [Bell Ul934l |. which grow at least exponentially in the 
number of nod es N. Indeed, optimiz ation of modularity is a NP-hard (Non-deterministic Polynomial-time 
hard) problem [Br andes et al. . 2008l |. Several a uthors hav e attacked the problera, with considerable success , 



by proposing different qptirn i zation heuristics Newmanl. 2004q : IClauset et al 



Duch fc Are^ . l2005l : IPuioI et all bood : iNewmanl . |200{| . see | Fortunatol rl 2010l | for a review. 



1994; 



Guimera fc Amaral. 



I2005al : 

Maximizing modularity one obtains the "best" partition of the network into communities. This parti- 
tion represents an intermediate topological scale of organization, or mesoscale, t hat in many cases has been 
shown to coincide with known information about subdivisions in the network Newman &: Girvanl . |2004| : 



Danon et am2005l |. However, recently, it has been pointed out that the optimization of the modularity has 



a characteristic scale related to the number of links in the network, which delimits the resolution beyond 
which no separation into smaller groups can be obtained when optimizing modularity, even though these 
smaller partitions, and then diff erent levels of description, are plausible to exist from direct observation 



Fortunato Sz Barthelemvl . 120071 ] . The problem seems then that modularity, as it has been prescribed, does 



not have access to these other levels of description. The reason for this is that the topological scale at 
which we have access by maximizing modularity has a topological resolution limit. 

We proposed a method that allows the full screening of the topological structure at any resolution 
level using the orig inal formulation and semantics of modularity, overcoming then the resolution limit 
Arenas et al\ . 20081 ]. Our aim is to take advantage o f this method to ana lyze real data sets in terms of 



clustering. In contrast with the solution proposed in [Angelini et all 120071 ] to find the correct clustering 
using modularity, here we present a multiple scale method based on the optimization of modularity as well. 
The mathematical form of our prescription is given by (5afg(^) = Qiwij ^ Wij +r6ij] where r (resistance) 



is the parameter controlling the resolution of the partitions we want to find. 



and Wij 



+ r6ij is the new 



weights' matrix after adding a self-loop with value r to each node. The definition of Qafg does preserve 
the original semantics of modularity. 

A different approach was proposed by Reichardt and Bornholdt Reichardt &: Bornholdtl . 12004 ]. in 
their work every node can be understood as a dynamical system of oscillators of q-states (usually known as 
Potts' model), and the partition in modules is equivalent to the ground state of the mentioned dynamical 
system. Indeed, the authors made a very interesting connection with the statistical mechanics of the Potts 
model and modularity. Moreover, although the finding of the resolution limit was d iscovered later, the RB 
method already solved this problem by the tuning of a parameter, as pointed out in jKumpula et aU 12007 ]. 
The result is that the ground state of the system corresponding to the minimum of its Hamiltonian can be 
written as 



^3hb(7) = i E E h - ^Ir) ' 

i j 

where 7 is the resolution control parameter in this case. Note that the original Q corresponds to 7 = 1 
where other values are different quality functions characterized by a weight the null model term. 

To screen the whole spectrum of resolution levels of the topological structure of any given network, we 
must determine the values of r^in and rmax for the AFG model, and the 7niin and 7max for the RB model, 
which will make the network to appear as an unique module or as a set of as many modules as nodes in the 
network. The mathematical determination of these limits is discussed in the Appendix for the most general 
case of directed and signed networks. The screening of the mesoscale is done by optimizing modularity 
Qafg(^)) and optimizing modularity in the Qrb(7), for the different values of r and 7 respectively. 



3. Results 

To show the ability of multiresolution community detection methods to solve the problem of unsupervised 
data clustering, we have chosen to study the classical benchmark of the Iris data set. This dataset, presented 
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Fig. 1. Two principal components of the PCA analysis on the Iris data set. Colors correspondence is: setosa-blue, versicolor- 
red, and vtrgimca-gTeen. While setosa is clearly linearly separable, the other two species are not. 



by Sir R.A. Fisher in 1936, consists of 50 samples from each of three species of Iris flowers {Iris setosa, 
Iris versicolor and Iris virginica). We know the petal length, petal width, sepal length and sepal width 
from each sample. For the moment, we will ignore t he species information and we will cluster the data 
using only the raw measurements as in |Fisherl . [l93fil |. When this is done, a comparison between the real 
classification and the obtained clusters can be made, in order to evaluate its quality. 

Following the steps of data clustering explained above, we first perform a principal component analysis 
of the four features that form each pattern, and choose to work with the two principal components cor- 
responding to the largest part of the data variance. In Fig. [1] a representation of these two components is 
shown. Based on these two variables, we build up a similarity matrix from the euclidean distances between 
patterns components with respect to the average distance in this space. For any pair of flowers i and j, we 
define the similarity Sij = d — ||x* — x-' ||), where d stands for the average distance of the set, and || • || is the 
euclidean distance between the feature vectors of each flower. The resulting similarity matrix is interpreted 
as a weighted network whose communities will, in principle, reproduce the right clustering of the data. 
Note that this matrix has positive and negative links, and that modularity should account for this signed 
values, see Appendix. 

We present the comparison of the results obtained using the algorithms described above, and also 
compare with the solution obtained applying a classical hierarchical clustering technique, see Fig. [2l 

In particular, we constructed the hierarchical clustering using complete linkage, where the distance 
between groups is defined as the distance between the most distant pair of individuals, one from each group. 
In other words, the distance between two clusters is given by the value of the longest link between the 
clusters. At each stage of hierarchical clustering, the clusters at minimum distance are merged. Moreover, 
instead of using the standard pair-gr oup hierarchical clust ering approach, we take advantage of a recent 
development by some of the authors Fernandez fc Gomej . 1^08] that allows to solve the non-uniquene ss 
problem when there are tied distances during the agglomeration process (code available at Gomez, 20101 ]). 
The result, known as a multidendrogram, is presented in Fig. [2^. We plot the tag number of each specimen 
at the leaves of the tree. The analysis of the multidendrogram can be performed as follows: starting from 
the root of the tree, we can compute the distances between different partitions of the data and analyze 
each of them separately. 

The comparison between the three methods can be done by computing the multiple scales of the 
topology in terms of community structure, screening the values of r in the AFG method, the values of 7 in 
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a) Multidendrogram b) AFC mesoscales 
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Fig. 2. Mesoscales of the Iris data set, showing the number of clusters as a function of the resolution parameter: a) Complete 
linkage multidendrogram; b) AFG mesoscales; c) RB mesoscales; d) HC mesoscales from the previous multidendrogram. 
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Fig. 3. Comparison between the three methods used in the classification of the Iris data set. Two measures are used: the 
success ratio (left) and the Jaccard index (right). Only the partitions with highest performance and less than five clusters are 
shown. 



the RB method, and the distances in the dendrogram. In Fig. [2)3 we present the whole mesoscale for the 
AFG method, we observe the persistence of the partition in two chisters, and the partition in three clusters 
as the more representatives of the mesoscale. In Fig. [2t we present a portion of the mesoscale for the RB 
method, again the last observation holds for this method, however, the variations of 7 do not ensure a 
monotonic behavior of the number of clusters as a function of 7 (see Appendix for details). Finally, we plot 
the mesoscale in terms of distances in the dendrogram, see Fig. [2li. The hierarchical clustering approach 
defines also two main resolution levels corresponding to two and three clusters partitions, respectively. The 
fact that the partition that divides the data in two communities is always the most relevant in any of the 
used methods corresponds to the true partition of the Iris data set in two linearly separable sets. 

We define two measures to make the comparison between the different methods, centering our attention 
in the most relevant partitions in terms of the scale length, see Fig. [3j The first measure is the success, 
which is computed as the percentage of correctly classified nodes when comparing the partition obtained 
with the original classification taxonomy made by biologists using more features of the fiowers. In this 
case and for the partition in three clusters, both HC and AFG methods achieve a 94, 67% of success, 
corresponding to a mismatch of eight flowers in total. The RB method obtains a succe s s of 9 0, 67% in 
this case. The second measure we contemplate is the Jaccard index presented in Jaccardl . Il912l |. which is 
the fraction of pairs of patterns in the same cluster in one partition which are also in the same cluster 
in the other partition. The larger the fraction of same cluster co-ocurrences, the better the quality of the 
agreement. In Fig. [3|^right) we observe that the best classification in three clusters is performed by the 
AFG method by a slight difference (0.8194 the AFC method versus 0.8180 the HC). 



4. Conclusions 

We have presented the adaptation and performance of two multiresolution methods, for the determination 
of the community structure in networks, to the problem of unsupervised data clustering. We focus on 
the determination of groups in the similarity matrix using modularity as the quality function. We have 
analytically computed the two limiting cases for the AFG method corresponding to the classification of 
the set as a unique cluster to the classification of every data point as a single cluster. The results on the 
classical Iris data set are competitive with classical unsupervised clustering techniques. These results are 
encouraging, and point out that the mapping of clustering problems to networks' structural analysis is a 
field worth to be explored. 
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Appendix A Determination of AFG mesoscales boundaries 

The generalization of modularity Eq. ([1]) for undirected weighted signed networks (see Gomez et al . 2009l |) 
is 
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the positive and negative total strengths respectively. Please note that these four strengths are defined to 
be non-negative. 

To simplify the notation, we make use of the modularity matrix 
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Following Arenas et alx . 120081 ] . the analysis of the mesoscale is performed with the addition of a 
common self-loop to all the nodes in the network. The boundaries of the mesoscale are the macroscale, a 
partition in which all nodes belong to the same community, and the microscale, a partition in which each 
node is isolated in its own community. The determination of these boundaries is equivalent to finding two 
values of the self- loops, rmin and rmax, for which the maximum of modularity Qafg('") is achieved at the 
macroscale and microscale respectively. The solution is quite simple: if all the non-diagonal terms of the 
modularity matrix are positive or zero, modularity is optimized at the macroscale, and if they are negative, 
it is optimized at the microscale. Diagonal terms are irrelevant since 5{Ci,Ci) = 1 for all nodes. 

If we introduce a positive self-loop r"*", the modularity matrix becomes 



{wf + r~^){Wj -I- r+) w- Wj 
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The existence of rmax is straightforward, since i?^^'^(r"'") ~ — r+ < for large enough r+ and i 7^ j. Its 
determination is just an exercise of solving the system of inequations Bf-^'^{r^) < for i < j, and taking 
the smallest solution as r^nax- More precisely, 



W= max (^-^ + ^^fDl.-AEij ] , (A.IO) 

where 

Dij =wt + w^-Nl Wij + 1 , (A.ll) 

Eij = w+wf - 2w+ Wij + . (A. 12) 

In the same way, Bf-^'^ {—r^) ~ r~ > proves the existence of rmin, and it is calculated by solving 
r~) > for i < j, and taking the largest solution 



r^in = - max ( + l^/Df- - AE.j ) , (A.13) 



where 



Dij = wT +WJ + N \w,^ - ^-^1 , (A.14) 

Wij - -^-^ . (A. 15) 

When the network is directed, the analysis of the AFG mesoscale is exactly the same, but with the 
substitutions 



± , ±,out 



= Y.Wik, (A.16) 

k 

Y^Wkj, (A. 17) 



^ ^{Dij + Dji) , (A. 18) 

E^J ^ l{E,j + Eji) . (A. 19) 



Appendix B Boundaries of RB mesoscales 

In the RB formulation of mesoscales, a parameter 7 is introduced in front of the null-case term to weight 
its relative importance against the real network, i.e. 

B5=(7)=». -7(^^-^j . (B.1) 

It is also possible to have different parameters for the positive and negative null-case terms as in 
Traag &: Bruggemanl . 2009^ . lowever this leads to a bidimensional analysis of the mesoscales, which is 



almost unaffordable for most real networks. Thus, we will focus on the single-parameter RB modularity 
matrix Eq. (|B.ip . 

Without negative weights, the macroscale is recovered at 7jnin = 0, and the microscale at the 7max 
which makes all modularity terms negative. The existence of 7max is guaranteed by the fact that all null-case 
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Fig. B.l. Expanded Iris data set RB mesoscales analysis. 

terms are positive. However, the addition of negative weights makes it possible to have both positive and 
negative nuh-case terms, which does not ahow to ensure the recovery of macro and microscale. Therefore, 
RB signed modularity may not cover the whole mesoscale. This is experimentally confirmed in Fig. IB.ll for 
the Iris data set, where a larger interval of the 7 parameter has been analyzed. While Fig. [2]: only shows 
the useful part of the mesoscales range, where the number of clusters goes from 2 to 73 (7 € [0.0,4.2]), in 
Fig. IB. II it is shown the inability of RB to find the macroscale (microscale) for lower (larger) values of 7. 
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